Morpho-syntactic labelling of an oral corpus by decomposing labels

نویسنده

  • Isabelle Tellier
چکیده

A morpho-syntactic tagger associates to each word of a corpus a label which recapitulates its morpho-syntatic properties in the text. In corpora from oral data, not only do we have to face the usual problem of multi-labels words, but also the more specific problems of disfluences (repetitions, ungrammatical constructions...), of non existing words and of the lack of punctuation marks [1]. First, inspired by the Cordial tagger, we have defined a new set of morphosyntactic labels well adapted to oral corpora. These labels are hierarchically defined according to three different levels: a POS level (L0, with 16 different labels), a morphological level (L1, with 72 labels) and a syntactico-semantic level (L3, with 107 labels). Then, we have built a reference corpus for our new set of labels. It has been obtained by using Cordial, then modifying its labels by scripts and manual corrections. The learning corpus finally contains 18 500 words belonging to 1 750 distinct “sequences”.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sequential Patterns of POS Labels Help to Characterize Language Acquisition

In this paper, we try to characterize various steps of the syntax acquisition of their native language by children with emerging sequential patterns of Part Of Speech (POS) labels. To achieve this goal, we first build a set of corpora from the French part of the CHILDES database. Then, we study the linguistic utterances of the children of various ages with tools coming from Natural Language Pro...

متن کامل

Etiqueter un corpus oral par apprentissage automatique à l'aide de connaissances linguistiques

Thanks to the Eslo1 (« Enquête sociolinguistique d'Orléans », i.e. « Sociolinguistic Inquiery of Orléans) campain, a large oral corpus has been gathered and transcribed in a textual format. The purpose of the work presented here is to associate a morpho-syntactic label to each unit of this corpus. To this aim, we have first studied the specificities of the necessary labels, and their various po...

متن کامل

M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases

In automatic speech understanding, division of continuous running speech into syntactic chunks is a great problem. Syntactic boundaries are often marked by prosodic means. For the training of statistical models for prosodic boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation), we developed a syntactic±prosodic labelling scheme ...

متن کامل

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Comparative study of oral and written French automatically tagged with morpho-syntactic information

In this paper, we investigate automatic tagging of French corpora and compare morpho-syntactic properties of spoken and written language on corpora from different sources. Morpho-syntactic properties are first described according to the distribution of the 8 main POS in five corpora of about 1 million words each. The automatic tagging was made with about a hundred tags and we will describe the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010